US Election 2020 Tweets

In 2020, the 59th election for the president of the United States was held. It took place on November 3 and Joe Biden stood as the Democrat side candidate, while Donald Trump ran for re-election on the Republican side.

The United States presidential election in 2020 was accompanied by exceptional economic and social events, which could have had a significant impact on the emotions accompanying voters. These include the COVID-19 pandemic, Racial unrest as a result of the murder of George Floyd, and numerous freedom strikes. What's more, the presidential debate related to economic plans, including in particular tax policy. The deliberations covered other socially sensitive issues such as environmental policy in the light of the ongoing climate change and health protection, known as Obamacare, was discussed.

Many of the events that shook emotions in 2020 had a political backing, and dividing voters mainly into two parties, i.e. Democrats and Republicans, only exacerbated the conflicts.

Will it be possible to evaluate voter's sentimentality through their entries on the Tweeter platform? Can the two parties behind Joe Biden and Donald Trump be distinguished by the emotions expressed on social media? Do any of the voters have clearly more extreme emotions?

For this purpose, a set of data containing tens of thousands of entries on Tweeter with the hashtag #JoeBiden or #DonaldTrump was analyzed.

Import libraries

In [1]:
import pandas as pd 
import numpy as np 
%matplotlib inline 

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()

from wordcloud import WordCloud,STOPWORDS

import matplotlib.pyplot as plt 
import plotly.express as px

Joe Biden Tweets

First, the data structure was looked at, redundant columns and N/A were removed.

In [2]:
data_biden = pd.read_csv('./Election data/hashtag_joebiden.csv', lineterminator='\n', parse_dates=True)
data_biden.head()
Out[2]:
created_at tweet_id tweet likes retweet_count source user_id user_name user_screen_name user_description ... user_followers_count user_location lat long city country continent state state_code collected_at
0 2020-10-15 00:00:01 1.316529e+18 #Elecciones2020 | En #Florida: #JoeBiden dice ... 0.0 0.0 TweetDeck 3.606665e+08 El Sol Latino News elsollatinonews 🌐 Noticias de interés para latinos de la costa... ... 1860.0 Philadelphia, PA / Miami, FL 25.774270 -80.193660 NaN United States of America North America Florida FL 2020-10-21 00:00:00
1 2020-10-15 00:00:18 1.316529e+18 #HunterBiden #HunterBidenEmails #JoeBiden #Joe... 0.0 0.0 Twitter for iPad 8.099044e+08 Cheri A. 🇺🇸 Biloximeemaw Locked and loaded Meemaw. Love God, my family ... ... 6628.0 NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-21 00:00:00.517827283
2 2020-10-15 00:00:20 1.316529e+18 @IslandGirlPRV @BradBeauregardJ @MeidasTouch T... 0.0 0.0 Twitter Web App 3.494182e+09 Flag Waver Flag_Wavers NaN ... 1536.0 Golden Valley Arizona 46.304036 -109.171431 NaN United States of America North America Montana MT 2020-10-21 00:00:01.035654566
3 2020-10-15 00:00:21 1.316529e+18 @chrislongview Watching and setting dvr. Let’s... 0.0 0.0 Twitter for iPhone 8.242596e+17 Michelle Ferg MichelleFerg4 NaN ... 27.0 NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-21 00:00:01.553481849
4 2020-10-15 00:00:22 1.316529e+18 #censorship #HunterBiden #Biden #BidenEmails #... 1.0 0.0 Twitter Web App 1.032807e+18 the Gold State theegoldstate A Silicon Valley #independent #News #Media #St... ... 390.0 California, USA 36.701463 -118.755997 NaN United States of America North America California CA 2020-10-21 00:00:02.071309132

5 rows × 21 columns

In [3]:
data_biden = data_biden.dropna()
data_biden = data_biden[['created_at', 'tweet']]
data_biden.rename(columns={'created_at': 'Timestamp', 'tweet': 'Text'}, inplace=True)
data_biden.to_csv("./Election data/data_joebiden.csv", index=None)
In [4]:
txtm_biden = pd.read_csv('./Election data/data_joebiden.csv')
txtm_biden['Text'] = txtm_biden['Text'].astype(str)
print(txtm_biden.shape)
txtm_biden.head()
(155955, 2)
Out[4]:
Timestamp Text
0 2020-10-15 00:00:25 In 2020, #NYPost is being #censorship #CENSORE...
1 2020-10-15 00:01:23 Comments on this? "Do Democrats Understand how...
2 2020-10-15 00:01:57 @RealJamesWoods #BidenCrimeFamily #JoeBiden #H...
3 2020-10-15 00:02:05 #Trump #Obama #Clinton #Biden\n\n#ManWomanPers...
4 2020-10-15 00:02:06 Come on @ABC PLEASE DO THE RIGHT THING. Move t...

From the NLTK package, we will use functions to calculate sentimental indicators. The VADER (Valence Aware Dictionary for Sentiment Reasoning) method was chosen.

In [5]:
%time 
i=0
compval1 = [ ]

while (i<len(txtm_biden)):

    k = analyser.polarity_scores(txtm_biden.iloc[i]['Text'])
    compval1.append(k['compound'])
    
    i = i+1
    
compval1 = np.array(compval1)

# The result of the operation is added to our data set
txtm_biden['VADER score'] = compval1
CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 7.87 µs

The Vader indicator should be interpreted that all values above 0 will be considered negative, and the Author decides to consider those above 0.7 as predicted strongly positive.

In [6]:
%time
i = 0

predicted_value = [ ] # empty series to hold our predicted values

while(i<len(txtm_biden)):
    if ((txtm_biden.iloc[i]['VADER score'] >= 0.7)):
        predicted_value.append('positive')
        i = i+1
    elif ((txtm_biden.iloc[i]['VADER score'] > 0) & (txtm_biden.iloc[i]['VADER score'] < 0.7)):
        predicted_value.append('neutral')
        i = i+1
    elif ((txtm_biden.iloc[i]['VADER score'] <= 0)):
        predicted_value.append('negative')
        i = i+1
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 10 µs
In [7]:
# Add predicted values to our data set and seperate days from the 'Timestmap' column
txtm_biden['predicted sentiment'] = predicted_value
txtm_biden['Date'] = txtm_biden['Timestamp'].apply(lambda s: s.split()[0])

# We save our data set with evaluated scores for later
txtm_biden.to_csv("/Users/damianzamojda/Desktop/DS_2020_21/TM Project/TXM Project Sentimental US election 2020/Election data/data_joebiden_sent.csv", index=None)
In [8]:
txtm_biden_sentiment = pd.read_csv('/Users/damianzamojda/Desktop/DS_2020_21/TM Project/TXM Project Sentimental US election 2020/Election data/data_joebiden_sent.csv')
txtm_biden_sentiment.head(5)
Out[8]:
Timestamp Text VADER score predicted sentiment Date
0 2020-10-15 00:00:25 In 2020, #NYPost is being #censorship #CENSORE... -0.4137 negative 2020-10-15
1 2020-10-15 00:01:23 Comments on this? "Do Democrats Understand how... 0.0000 negative 2020-10-15
2 2020-10-15 00:01:57 @RealJamesWoods #BidenCrimeFamily #JoeBiden #H... 0.0000 negative 2020-10-15
3 2020-10-15 00:02:05 #Trump #Obama #Clinton #Biden\n\n#ManWomanPers... 0.0000 negative 2020-10-15
4 2020-10-15 00:02:06 Come on @ABC PLEASE DO THE RIGHT THING. Move t... 0.7241 positive 2020-10-15

We will use the VADER score tool and time-structured data to check whether tweets are characterized by sentimentality that changes over time.

Public opinion is shaped by the information that reaches it, and that information can be manipulated (Max Weber, 1988). This type of activity might be conducive to the time of the election campaign, when the activity of political parties increases and their activities are accompanied by increased media interest.

As public opinion does not exist per se, political communication is possible (Eric Maigret, 2012).

Considering the above, politicians use various methods of communication and opinion transmission, which may be characterized by a certain level of sentimentality, examined at work.

How sensitive is the content to the upcoming elections and what is the percentage of Tweets classified as positive compared to potentially negative before and after the November 3 elections? The chart also had information that on November 6 the so-called 'Swing States' were already recognizable.

In [9]:
txtm_biden_sentiment.groupby('predicted sentiment').size().plot(kind='bar')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa3b088f2d0>
In [10]:
#data preparation
txtm_biden_temp = txtm_biden_sentiment.groupby(['Date','predicted sentiment']).count().reset_index()
txtm_biden_temp = txtm_biden_temp[['Date', 'predicted sentiment', 'Timestamp']]
txtm_biden_temp.rename(columns={'Date': 'Date', 'predicted sentiment': 'Sentiment','Timestamp': 'Count'}, inplace=True)
txtm_biden_temp = np.array(txtm_biden_temp)
txtm_biden_temp = pd.DataFrame(txtm_biden_temp[3:78])
txtm_biden_temp.rename(columns={0: 'Date', 1: 'Sentiment',2: 'Count'}, inplace=True)
txtm_biden_temp.head(10)
Out[10]:
Date Sentiment Count
0 2020-10-15 negative 2086
1 2020-10-15 neutral 628
2 2020-10-15 positive 201
3 2020-10-16 negative 2480
4 2020-10-16 neutral 1048
5 2020-10-16 positive 430
6 2020-10-17 negative 1659
7 2020-10-17 neutral 559
8 2020-10-17 positive 176
9 2020-10-18 negative 1405
In [11]:
# crosstabed data structure is needed to this type of plot
p = np.array(txtm_biden_temp['Date'])
q = np.array(txtm_biden_temp['Sentiment'])
r = np.array(txtm_biden_temp['Count'])
crossed_data = pd.crosstab(p, columns=q, values=r, aggfunc='sum', rownames='p', colnames='r')
In [12]:
crossed_data['sum'] = crossed_data['negative']+crossed_data['neutral']+crossed_data['positive']
crossed_data['pos_per'] = crossed_data['positive']/crossed_data['sum']
crossed_data['neg_per'] = crossed_data['negative']/crossed_data['sum']
crossed_data['neu_per'] = crossed_data['neutral']/crossed_data['sum']
crossed_data['date'] = crossed_data.index

crossed_data = np.array(crossed_data)
crossed_data = pd.DataFrame(crossed_data)

Basic indicators per day have been calculated and showed below:

  • Number of positive Tweets according to the calculated VADER indicator,
  • Number of negative Tweets according to the calculated VADER indicator,
  • Number of neutral Tweets,
  • Percentage of positive to total Tweets,
  • Percentage of negative tweets to total,
  • Percentage of neutral tweets.
In [13]:
crossed_data.rename(columns={0: 'Negative', 1: 'Neutral',2: 'Positive',3: 'Sum', 4: 'Posper',5: 'Negper',6: 'Neuper',7:'Date'}, inplace=True)

crossed_data.head(5)
Out[13]:
Negative Neutral Positive Sum Posper Negper Neuper Date
0 2086 628 201 2915 0.0689537 0.715609 0.215437 2020-10-15
1 2480 1048 430 3958 0.108641 0.626579 0.26478 2020-10-16
2 1659 559 176 2394 0.0735171 0.692982 0.2335 2020-10-17
3 1405 544 197 2146 0.0917987 0.654706 0.253495 2020-10-18
4 1606 619 201 2426 0.0828524 0.661995 0.255153 2020-10-19
In [14]:
df = crossed_data[['Date','Posper','Negper','Neuper']]


plt = df.plot.area(x='Date', y=['Posper','Negper','Neuper'],colormap='winter', figsize=(10,7))

plt.axvline(x=22, color='r', linestyle='--', lw=3, label='6th November - US Election Day')
plt.axvline(x=19, color='y', linestyle='--', lw=3, label='3th November - US Election Day')

plt.set_xlabel('')
plt.set_ylabel('Predicted Sentiment Share')

plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
Out[14]:
<matplotlib.legend.Legend at 0x7fa368a8e390>

From the chart, you can observe some increase in negative tweets - perhaps as a result of the signs of potential loss of Joe Biden.

On the other hand, the date of November 6, when it was already possible to observe which states will decide who will become the next president of the United States, the share of negative tweets fell to the lowest level in the entire period. Could a reversal of sentiment be inferred from this moment?

You can definitely observe the 'thickening' of moods and the frequency of shared posts. After November 3, there were the most tweets - see the plot below.

In [15]:
fig1 = px.scatter(txtm_biden_sentiment, x="Timestamp",
               y="VADER score",
               hover_data=["VADER score"],
               color_discrete_sequence=["lightseagreen", "indianred", "mediumpurple"],
               color="predicted sentiment",
               size_max=10,
               title=f"Biden Tweets"
          )

fig1

Trump Part

All process considering data preparation were very simillar to Joe Biden's Part, the data structure was looked at, redundant columns and N/A were removed.

In [16]:
data_trump = pd.read_csv('./Election data/hashtag_donaldtrump.csv', lineterminator='\n', parse_dates=True)
data_trump.head()
Out[16]:
created_at tweet_id tweet likes retweet_count source user_id user_name user_screen_name user_description ... user_followers_count user_location lat long city country continent state state_code collected_at
0 2020-10-15 00:00:01 1.316529e+18 #Elecciones2020 | En #Florida: #JoeBiden dice ... 0.0 0.0 TweetDeck 3.606665e+08 El Sol Latino News elsollatinonews 🌐 Noticias de interés para latinos de la costa... ... 1860.0 Philadelphia, PA / Miami, FL 25.774270 -80.193660 NaN United States of America North America Florida FL 2020-10-21 00:00:00
1 2020-10-15 00:00:01 1.316529e+18 Usa 2020, Trump contro Facebook e Twitter: cop... 26.0 9.0 Social Mediaset 3.316176e+08 Tgcom24 MediasetTgcom24 Profilo ufficiale di Tgcom24: tutte le notizie... ... 1067661.0 NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-21 00:00:00.373216530
2 2020-10-15 00:00:02 1.316529e+18 #Trump: As a student I used to hear for years,... 2.0 1.0 Twitter Web App 8.436472e+06 snarke snarke Will mock for food! Freelance writer, blogger,... ... 1185.0 Portland 45.520247 -122.674195 Portland United States of America North America Oregon OR 2020-10-21 00:00:00.746433060
3 2020-10-15 00:00:02 1.316529e+18 2 hours since last tweet from #Trump! Maybe he... 0.0 0.0 Trumpytweeter 8.283556e+17 Trumpytweeter trumpytweeter If he doesn't tweet for some time, should we b... ... 32.0 NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-21 00:00:01.119649591
4 2020-10-15 00:00:08 1.316529e+18 You get a tie! And you get a tie! #Trump ‘s ra... 4.0 3.0 Twitter for iPhone 4.741380e+07 Rana Abtar - رنا أبتر Ranaabtar Washington Correspondent, Lebanese-American ,c... ... 5393.0 Washington DC 38.894992 -77.036558 Washington United States of America North America District of Columbia DC 2020-10-21 00:00:01.492866121

5 rows × 21 columns

In [17]:
data_trump = data_trump.dropna()
data_trump = data_trump[['created_at', 'tweet']]
data_trump.rename(columns={'created_at': 'Timestamp', 'tweet': 'Text'}, inplace=True)
data_trump.to_csv("./Election data/data_trump.csv", index=None)
In [18]:
txtm_trump = pd.read_csv('./Election data/data_trump.csv')
txtm_trump['Text'] = txtm_trump['Text'].astype(str)
print(txtm_trump.shape)
txtm_trump.head()
(189282, 2)
Out[18]:
Timestamp Text
0 2020-10-15 00:00:02 #Trump: As a student I used to hear for years,...
1 2020-10-15 00:00:08 You get a tie! And you get a tie! #Trump ‘s ra...
2 2020-10-15 00:00:25 In 2020, #NYPost is being #censorship #CENSORE...
3 2020-10-15 00:00:26 #Trump #PresidentTrump #Trump2020LandslideVict...
4 2020-10-15 00:00:31 @Susan_Hutch @JoeBiden #Ukraine @RealDonaldTru...
In [19]:
%time
i=0

compval2 = [ ]
while (i<len(txtm_trump)):

    k = analyser.polarity_scores(txtm_trump.iloc[i]['Text'])
    compval2.append(k['compound'])
    
    i = i+1
    
compval2 = np.array(compval2)

# The result of the operation is added to our data set
txtm_trump['VADER score'] = compval2
len(compval2)
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.87 µs
Out[19]:
189282
In [20]:
%time
i = 0

predicted_value = [ ]

# Very same levels
while(i<len(txtm_trump)):
    if ((txtm_trump.iloc[i]['VADER score'] >= 0.7)):
        predicted_value.append('positive')
        i = i+1
    elif ((txtm_trump.iloc[i]['VADER score'] > 0) & (txtm_trump.iloc[i]['VADER score'] < 0.7)):
        predicted_value.append('neutral')
        i = i+1
    elif ((txtm_trump.iloc[i]['VADER score'] <= 0)):
        predicted_value.append('negative')
        i = i+1
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.91 µs
In [21]:
# Add predicted values to our data set and seperate days from the 'Timestmap' column
txtm_trump['predicted sentiment'] = predicted_value
txtm_trump['Date'] = txtm_trump['Timestamp'].apply(lambda s: s.split()[0])

# We save our data set with evaluated scores for later
txtm_trump.to_csv("./Election data/data_trump_sent.csv", index=None)
In [22]:
txtm_trump_sent = pd.read_csv('./Election data/data_trump_sent.csv')
txtm_trump_sent.head()
Out[22]:
Timestamp Text VADER score predicted sentiment Date
0 2020-10-15 00:00:02 #Trump: As a student I used to hear for years,... 0.4738 neutral 2020-10-15
1 2020-10-15 00:00:08 You get a tie! And you get a tie! #Trump ‘s ra... 0.0000 negative 2020-10-15
2 2020-10-15 00:00:25 In 2020, #NYPost is being #censorship #CENSORE... -0.4137 negative 2020-10-15
3 2020-10-15 00:00:26 #Trump #PresidentTrump #Trump2020LandslideVict... 0.0000 negative 2020-10-15
4 2020-10-15 00:00:31 @Susan_Hutch @JoeBiden #Ukraine @RealDonaldTru... -0.5386 negative 2020-10-15
In [23]:
txtm_trump_sent.groupby('predicted sentiment').size().plot(kind='bar')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa3377fb310>

We can observe that the percentage of negative Tweets is also the majority of Tweets with the hashtag #Trump. The high percentage of negative Tweets may result from the levels at which the VADER indicator is interpreted, but how do both groups of data relate to each other?

In [24]:
boxplot_df = txtm_biden_sentiment.groupby('predicted sentiment').size()
boxplot_df = pd.DataFrame(boxplot_df)
boxplot_df['Trump'] = txtm_trump_sent.groupby('predicted sentiment').size()
boxplot_df.rename(columns={0: 'Biden'}, inplace=True)
In [25]:
boxplot_df.plot(kind='bar',title='Sentimental Analysis for All Tweets', figsize=(10,7))
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa337814210>

It is now more evident that in terms of the number of Tweets #Trump, more were classified as negative and neutral, while there is less contribution to positive Tweets than #Biden.

In [26]:
txtm_trump_temp = txtm_trump_sent.groupby(['Date','predicted sentiment']).count().reset_index()
txtm_trump_temp = txtm_trump_temp[['Date', 'predicted sentiment', 'Timestamp']]
txtm_trump_temp.rename(columns={'Date': 'Date', 'predicted sentiment': 'Sentiment','Timestamp': 'Count'}, inplace=True)

txtm_trump_temp = np.array(txtm_trump_temp)
txtm_trump_temp = pd.DataFrame(txtm_trump_temp[3:77])
txtm_trump_temp.rename(columns={0: 'Date', 1: 'Sentiment',2: 'Count'}, inplace=True)
txtm_trump_temp.head(10)
Out[26]:
Date Sentiment Count
0 2020-10-15 negative 2875
1 2020-10-15 neutral 930
2 2020-10-15 positive 303
3 2020-10-16 negative 3856
4 2020-10-16 neutral 1413
5 2020-10-16 positive 455
6 2020-10-17 negative 2601
7 2020-10-17 neutral 888
8 2020-10-17 positive 240
9 2020-10-18 negative 2598
In [27]:
# crosstabed data structure is needed to this type of plot
p = np.array(txtm_trump_temp['Date'])
q = np.array(txtm_trump_temp['Sentiment'])
r = np.array(txtm_trump_temp['Count'])

crossed_data_trump = pd.crosstab(p, columns=q, values=r, aggfunc='sum', rownames='p', colnames='r')
crossed_data_trump=crossed_data_trump.dropna()

crossed_data_trump['sum'] = crossed_data_trump['negative']+crossed_data_trump['neutral']+crossed_data_trump['positive']
crossed_data_trump['pos_per'] = crossed_data_trump['positive']/crossed_data_trump['sum']
crossed_data_trump['neg_per'] = crossed_data_trump['negative']/crossed_data_trump['sum']
crossed_data_trump['neu_per'] = crossed_data_trump['neutral']/crossed_data_trump['sum']
crossed_data_trump['date'] = crossed_data_trump.index

Basic indicators per day have been calculated and showed below:

  • Number of positive Tweets according to the calculated VADER indicator,
  • Number of negative Tweets according to the calculated VADER indicator,
  • Number of neutral Tweets,
  • Percentage of positive to total Tweets,
  • Percentage of negative tweets to total,
  • Percentage of neutral tweets.
In [28]:
crossed_data_trump = np.array(crossed_data_trump)
txtm_trump_df = pd.DataFrame(crossed_data_trump)
txtm_trump_df.rename(columns={0: 'Negative', 1: 'Neutral',2: 'Positive',3: 'Sum', 4: 'Posper',5: 'Negper',6: 'Neuper',7:'Date'}, inplace=True)

txtm_trump_df.head(5)
Out[28]:
Negative Neutral Positive Sum Posper Negper Neuper Date
0 2875 930 303 4108 0.0737585 0.699854 0.226388 2020-10-15
1 3856 1413 455 5724 0.0794899 0.673655 0.246855 2020-10-16
2 2601 888 240 3729 0.0643604 0.697506 0.238134 2020-10-17
3 2598 973 373 3944 0.094574 0.658722 0.246704 2020-10-18
4 3048 1046 298 4392 0.0678506 0.693989 0.23816 2020-10-19
In [29]:
df = txtm_trump_df[['Date','Posper','Negper','Neuper']]


plt = df.plot.area(x='Date', y=['Posper','Negper','Neuper'],colormap='winter', figsize=(10,7))

plt.axvline(x=20, color='r', linestyle='--', lw=3, label='6th November - US Election Day')
plt.axvline(x=17, color='y', linestyle='--', lw=3, label='3th November - US Election Day')

plt.set_xlabel('')
plt.set_ylabel('Predicted Sentiment Share')

plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
Out[29]:
<matplotlib.legend.Legend at 0x7fa3245383d0>

From the chart, you can observe some increase in negative tweets and decrease of neutral posts mostly - perhaps as a result of the signs of potential loss of Joe Biden, there is no significant fall of positive Tweets.

On the other hand, the date of November 6, when the 'Swing States' were visible, the share of negative tweets rised to the highest level in the entire period. Could a reversal of sentiment be inferred from this moment?

We observe the total oposit reaction to the observed one with #Biden.

Most frequent words per hashtag

Let's check what words most often appeared in posted Tweets, depending on whether they contained #Biden or #Trump. After the first analysis, we expect to see more words associated with negativity on the side of the #Trump axis. The words on the side of the #Biden axis should be more emotionally neutral.

Plot data preparation

In [30]:
import pandas as pd 
import numpy as np 
%matplotlib inline 
import spacy
import en_core_web_sm
import nltk
#nltk.download('punkt')
import nltk.corpus
nlp = en_core_web_sm.load()
%matplotlib inline
import scattertext as st
import re, io
import os, pkgutil, json, urllib

from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))
from nltk.corpus import stopwords 
#nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.stem import PorterStemmer 
from nltk.tokenize.treebank import TreebankWordDetokenizer
from pprint import pprint
from scipy.stats import rankdata, hmean, norm

def filterStop_words(words):
    stop_words = set(stopwords.words(['english','german']))
    filtered_sentence = [] 
    stop_words = stop_words
    
    for w in words: 
            if w not in stop_words: 
                filtered_sentence.append(w)
    
    return filtered_sentence
In [31]:
txtm_trump = pd.read_csv('./Election data/data_trump.csv')
print(txtm_trump.shape)
#txtm_trump['Text'] = txtm_trump['Text'].astype(str)
txtm_trump.head()
(189282, 2)
Out[31]:
Timestamp Text
0 2020-10-15 00:00:02 #Trump: As a student I used to hear for years,...
1 2020-10-15 00:00:08 You get a tie! And you get a tie! #Trump ‘s ra...
2 2020-10-15 00:00:25 In 2020, #NYPost is being #censorship #CENSORE...
3 2020-10-15 00:00:26 #Trump #PresidentTrump #Trump2020LandslideVict...
4 2020-10-15 00:00:31 @Susan_Hutch @JoeBiden #Ukraine @RealDonaldTru...
In [32]:
txtm_biden = pd.read_csv('./Election data/data_joebiden.csv')
print(txtm_biden.shape)
txtm_biden.head()
(155955, 2)
Out[32]:
Timestamp Text
0 2020-10-15 00:00:25 In 2020, #NYPost is being #censorship #CENSORE...
1 2020-10-15 00:01:23 Comments on this? "Do Democrats Understand how...
2 2020-10-15 00:01:57 @RealJamesWoods #BidenCrimeFamily #JoeBiden #H...
3 2020-10-15 00:02:05 #Trump #Obama #Clinton #Biden\n\n#ManWomanPers...
4 2020-10-15 00:02:06 Come on @ABC PLEASE DO THE RIGHT THING. Move t...
In [33]:
# sample to shorten execution time and assure equal subpopulations
trump_plot_data = txtm_trump.sample(30000)
trump_plot_data['Candidate'] = "Trump"
trump_plot_data = trump_plot_data[['Text','Candidate']]

biden_plot_data = txtm_biden.sample(30000)
biden_plot_data['Candidate'] = "JoeBiden"
biden_plot_data = biden_plot_data[['Text','Candidate']]
In [34]:
plot_data_union = pd.concat([trump_plot_data, biden_plot_data])
plot_data_union
Out[34]:
Text Candidate
11020 #USElectionTradeFacts Bei ihren #Townhalls st... Trump
157614 150 000 bulletins de militaires du monde entie... Trump
40434 #Trump “did lie like Pinocchio” - @jaketapper\... Trump
93046 @CHIZMAGA #Biden is fighting for #China 🇨🇳..#T... Trump
98971 View: What is at stake for India in the US ele... Trump
... ... ...
146731 Are you start your business marketing ?Click h... JoeBiden
19723 I wish Biden had half the abilities of Obama i... JoeBiden
50179 @LisaMoraitis1 @rosepetals it's quite probable... JoeBiden
57975 Wall Street has been warming to the idea that ... JoeBiden
153645 @manuwiki Yes Manuela, these news are even bet... JoeBiden

60000 rows × 2 columns

In [35]:
import re

# Regex cleaning

plot_data_union['Text'] = plot_data_union['Text'].astype(str)

plot_data_union['test'] = plot_data_union.apply(
    lambda row: re.sub(r'[^a-zA-Z ]', '',row['Text']),
    axis=1
    )

plot_data_union['test1'] = plot_data_union.apply(
    lambda row: re.sub(r'.rump', '',row['test']),
    axis=1
    )
plot_data_union['test2'] = plot_data_union.apply(
    lambda row: re.sub(r'.iden', '',row['test1']),
    axis=1
    )
plot_data_union['test3'] = plot_data_union.apply(
    lambda row: re.sub(r'.oe', '',row['test2']),
    axis=1
    )
plot_data_union['Clean'] = plot_data_union.apply(
    lambda row: re.sub(r'.onald', '',row['test3']),
    axis=1
    )
plot_data_union = plot_data_union.drop(['test', 'test1', 'test2', 'test3'], axis = 1)
plot_data_union
Out[35]:
Text Candidate Clean
11020 #USElectionTradeFacts Bei ihren #Townhalls st... Trump USElectionTradeFacts Bei ihren Townhalls stel...
157614 150 000 bulletins de militaires du monde entie... Trump bulletins de militaires du monde entiernon e...
40434 #Trump “did lie like Pinocchio” - @jaketapper\... Trump did lie like Pinocchio jaketapperUSPretialDe...
93046 @CHIZMAGA #Biden is fighting for #China 🇨🇳..#T... Trump CHIZMAGA is fighting for China is fighting f...
98971 View: What is at stake for India in the US ele... Trump View What is at stake for India in the US elec...
... ... ... ...
146731 Are you start your business marketing ?Click h... JoeBiden Are you start your business marketing Click he...
19723 I wish Biden had half the abilities of Obama i... JoeBiden I wish had half the abilities of Obama in spe...
50179 @LisaMoraitis1 @rosepetals it's quite probable... JoeBiden LisaMoraitis rosepetals its quite probable I l...
57975 Wall Street has been warming to the idea that ... JoeBiden Wall Street has been warming to the idea that ...
153645 @manuwiki Yes Manuela, these news are even bet... JoeBiden manuwiki Yes Manuela these news are even bette...

60000 rows × 3 columns

In [36]:
plot_data_union['Parsed'] = plot_data_union.Clean.apply(word_tokenize)

plot_data_union['Filtered'] = plot_data_union.Parsed.apply(filterStop_words)

plot_data_union['deTokenized'] = plot_data_union.Filtered.apply(TreebankWordDetokenizer().detokenize)

plot_data_union = plot_data_union.drop(['Clean', 'Parsed', 'Filtered'], axis = 1)
plot_data_union
Out[36]:
Text Candidate deTokenized
11020 #USElectionTradeFacts Bei ihren #Townhalls st... Trump USElectionTradeFacts Bei Townhalls stellten Fr...
157614 150 000 bulletins de militaires du monde entie... Trump bulletins de militaires monde entiernon encore...
40434 #Trump “did lie like Pinocchio” - @jaketapper\... Trump lie like Pinocchio jaketapperUSPretialDebateUS...
93046 @CHIZMAGA #Biden is fighting for #China 🇨🇳..#T... Trump CHIZMAGA fighting China fighting American people
98971 View: What is at stake for India in the US ele... Trump View What stake India US elections httpstcoBep...
... ... ... ...
146731 Are you start your business marketing ?Click h... JoeBiden Are start business marketing Click httpstcomhK...
19723 I wish Biden had half the abilities of Obama i... JoeBiden I wish half abilities Obama speaking abilities...
50179 @LisaMoraitis1 @rosepetals it's quite probable... JoeBiden LisaMoraitis rosepetals quite probable I live ...
57975 Wall Street has been warming to the idea that ... JoeBiden Wall Street warming idea precy would bullish s...
153645 @manuwiki Yes Manuela, these news are even bet... JoeBiden manuwiki Yes Manuela news even better election...

60000 rows × 3 columns

In [37]:
plot_data_union.to_csv("./Election data/plot_data_union.csv", index=None)

Most frequent word PLOT

In [38]:
plot_data_temp = pd.read_csv("./Election data/plot_data_union.csv")
plot_data_temp['deTokenized'] = plot_data_temp['deTokenized'].astype(str)
In [39]:
plot_data_temp['Final'] = plot_data_temp.deTokenized.apply(nlp)
plot_data_temp
Out[39]:
Text Candidate deTokenized Final
0 #USElectionTradeFacts Bei ihren #Townhalls st... Trump USElectionTradeFacts Bei Townhalls stellten Fr... (USElectionTradeFacts, Bei, Townhalls, stellte...
1 150 000 bulletins de militaires du monde entie... Trump bulletins de militaires monde entiernon encore... (bulletins, de, militaires, monde, entiernon, ...
2 #Trump “did lie like Pinocchio” - @jaketapper\... Trump lie like Pinocchio jaketapperUSPretialDebateUS... (lie, like, Pinocchio, jaketapperUSPretialDeba...
3 @CHIZMAGA #Biden is fighting for #China 🇨🇳..#T... Trump CHIZMAGA fighting China fighting American people (CHIZMAGA, fighting, China, fighting, American...
4 View: What is at stake for India in the US ele... Trump View What stake India US elections httpstcoBep... (View, What, stake, India, US, elections, http...
... ... ... ... ...
59995 Are you start your business marketing ?Click h... JoeBiden Are start business marketing Click httpstcomhK... (Are, start, business, marketing, Click, https...
59996 I wish Biden had half the abilities of Obama i... JoeBiden I wish half abilities Obama speaking abilities... (I, wish, half, abilities, Obama, speaking, ab...
59997 @LisaMoraitis1 @rosepetals it's quite probable... JoeBiden LisaMoraitis rosepetals quite probable I live ... (LisaMoraitis, rosepetals, quite, probable, I,...
59998 Wall Street has been warming to the idea that ... JoeBiden Wall Street warming idea precy would bullish s... (Wall, Street, warming, idea, precy, would, bu...
59999 @manuwiki Yes Manuela, these news are even bet... JoeBiden manuwiki Yes Manuela news even better election... (manuwiki, Yes, Manuela, news, even, better, e...

60000 rows × 4 columns

In [40]:
corpus = st.CorpusFromParsedDocuments(plot_data_temp, category_col='Candidate', parsed_col='Final').build()
In [41]:
# Visualize the chart
html = produce_scattertext_explorer(corpus,
                                    category='Trump',
                                    category_name='D.Trump',
                                    not_category_name='J.Biden',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=50,
                                    transform=st.Scalers.log_scale_standardize)

file_name = 'Election2020ScattertextScale.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Out[41]:

Scattertext is visualization of how language differs among document types. The data has been processed in such a way as to allow you to suspect the most frequently used words depending on whether they belong to #Trump or #Biden.

Final Findings

Based on the sentimental analysis using the VADER method, it was possible to subordinate the estimated sentiment, i.e. neutral, positive or negative. The levels selected by the Author suggest a strong advantage of Tweets with a predicted negative nature, but it may result, among others, from with high levels fixed for positive. On the other hand, the advantage of the ratio of negative tweets to the total in the case of #Trump over the same ratio in the case of #Biden tweets, the author leaves for free interpretation.

When the timeline is superimposed on the sentimental analysis, as a result of the visual analysis, it is argued that external factors may influence the share of negative Tweets in total. The superimposition of the election dates on the president of the United States and the dates of disclosure of the 'Swing States' are only a visual device of the Author and it is not stated that these events have an unequivocal influence on the studied variable.

Changes in social relations and aggressiveness in communication over time may highlight the use of political narratives in the timeline of the presidential elections.

Bearing in mind the works of other authors, that public opinion can be directed and is guided by the behavior and opinions of politicians, one can try to prove that the moods of voters of individual parties are analogous or correlated with the moods accompanying the parties themselves.

In [ ]: